arXiv:1509.01275vl [stat.AP] 3 Sep 2015 


Modeling Long-term Ontcomes and Treatment Effects After Androgen 
Deprivation Therapy for Prostate Cancer 

Yolanda Hagar, James J. Dignam, and Vanja Dukic* 


Abstract 

Analyzing outcomes in long-term cancer survivor studies can be complex. The effects of predictors on the failure 
process may be difficult to assess over longer periods of time, as the commonly used assumption of proportionality 
of hazards holding over an extended period is often questionable. In this manuscript, we compare seven different 
survival models that estimate the hazard rate and the effects of proportional and non-proportional covariates. 
In particular, we focus on an extension of the the multi-resolution hazard (MRH) estimator, combining a non¬ 
proportional hierarchical MRH approach with a data-driven pruning algorithm that allows for computational 
efficiency and produces robust estimates even in times of few observed failures. Using data from a large-scale ran¬ 
domized prostate cancer clinical trial, we examine patterns of biochemical failure and estimate the time-varying 
effects of androgen deprivation therapy treatment and other covariates. We compare the impact of different mod¬ 
eling strategies and smoothness assumptions on the estimated treatment effect. Our results show that the benefits 
of treatment diminish over time, possibly with implications for future treatment protocols. 
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1 Introduction 


Many human cancers today are considered “chronic diseases”, with long-term disease trajectories and multiple co¬ 
morbidities. Consequently, long-term cancer outcomes may be affected by numerous factors, ranging from obvious 
patient and treatment characteristics to secular and health care trends that affect treatment policy and practice. 
However, the long-term nature of patient and health-care related processes and changing complexity of information 
can make the analysis of long-term patterns in cancer survivor data sets challenging. In addition to sparsely observed 
failure times, these data often exhibit non-proportional effects over time, requiring flexible and computationally 
efficient statistical methods to characterize the evolving failure hazard. 

The motivating problem in this article comes from a set of prostate cancer clinical trials from the Radiation 
Therapy Oncology Group (RTOG), which are specifically designed to examine the effects of the length of androgen 
deprivation (AD) therapy on disease-free and overall survival time. As a result of the insight gained from these studies 
over the years, the short-term be nefits of AD therapy have b een well-understood to delay the time until prostate 
cancer recurrence and until death ( Pileoich et ah . 2001. 2005ll. In ad dition, longer-duration AD therapy has proven 
more beneficial than a shorter-duration therapy ( Horwitz et al.l . [ioOSfl . However, given that AD treatments can have 
unpleasant side-effects, clinicians have been reluctant to assign androgen deprivation for longer than necessary. 

Many questions remain regardin g the relationship bet ween the length of AD therapy and long-term outcomes, 
such as eventual time to recurrence ( Schroder et al.l. 2012). These o pen questions still exist in part because prostate 
cancer is generally a slow cancer to progress ( Albertsen et ah . 2005ll . While prostate cancer patients tend to survive 
longer and are thus observed over extended periods of time, treatment benefits have been difficult to precisely 
ascertain due to a multitude of co-morbidities and sparsity of information at long follow-up times. 

Thus, assessing the long-term benefits of different duration of AD therapy for patients in diffe rent risk classes 


woul d be of great value to clinical practice and management of prostate cancer patients in general ([Schroder et al 


20121) . Gaining insight into recurrence patterns and quantifying the degree and length of long-term benefits over 
time would greatly improve the quality of life for men with this disease. For this reason, the focus of our paper 
goes beyond integrated summaries such as survival curves and cumulative incidence functions, and concentrates on 
estimation and inference about the time-dynamic hazard function in the presence of covariates, and the time-evolving 
predictive probabilities of disease recurrence. 

The underlying statistical approach employed in this paper is an extension of the multi-res olution hazard (MRH 
mode l, a Bayesian semi-p a rame tr ic hazard rate estim at or previous l y pres ented and used in Bouman et al.l ( 200__ 
2007 ). IPukic and Dignain ( 2007 ). Dignam et al.l ( 20091) . Ghen et m] ( 2014I ). and iHagar et al. ( 2014a[) . This flexible 


i 


class of models for time-to-event data is based o n the Polya tree met hodology, and is also similar in spirit to the 
adaptive piece-wise constant exponential models ( Ibrahim et al.l . EoOll ). The MRH parametrization is designed for 
multi-resolution inference capable of accommodating periods of sparse events and varying smoothness, typical in 
long-term studies. In addition, the MRH model accommodates both proportional and n on-proportional eff ects of 
predictors over time. The current methodology employs the pruning algorithm presented in IChen et al.l (|2014l) . which 
performs adaptive and data-driven “pre-smoothing” of the hazard rate, via merging of time intervals with similar 
hazard levels. Pruning has been shown to increase computational efficiency and reduce over all uncertainty in h azard 


20141) . All 


rate estimation in the presence of periods with smooth haz ard rate and low ev ent counts (IChen et al 
MRH models have been fitted using the ‘MRH” R package ( Hagar et al.l . l2014b ). 

This paper is organized as follows. In the next section, we provide a short overview of the prostate cancer clinical 
trials data, and the statistical issues related to these studies with long-term follow-up. Sections 3 and 4 present 
the corresponding MRH methodology and implementation. Section 5 presents the analysis of biochemical failure 
in prostate cancer, with comparisons of the MRH approach to a set of alternative models: the Gox proportional 
hazards model, an extended Cox model that includes a time-varying treatment effect, a non-proportional hazards 
Weibull parametric model, a semi-parametric Bayesian accelerated failure time model, a dependent Dirichlet Process 
survival model, and two piece-wise exponential models. In addition we perform a sensitivity analysis to the priors in 
the MRH model. The article concludes with a discussion of the clinical and statistical importance and implications 
of our findings. 
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2 Motivating Example: Androgen Deprivation in Prostate Cancer 


Typical prostate treatment involves radiation therapy combined with some form and duration of hormone treatment, 
which is known as androgen deprivation (AD) therapy. The motivation for the current analysis is the characterization 
of the hazard rate of time-to-biochemical failure, adjusted for the (potentially) time-varying effects of AD therapy 
and several key covariates, as described below. 

The outcome of interest in this analysis, “biochemical failure”, is defined according to the Phoenix defin i tion a s 
a two-unit rise in prostate specific antigen (PSA) level following a post-treatment PSA nadir ( Roach et ah . 2006[l . 
Prostate specific antigen is a glycoprotein produced almost solely by prostatic epithelial cells, and is a biomarker 
routinely measured to screen for possible presence of prostate cancer. Men with prostatic diseases (including cancer) 
can have high serum PSA levels due to structural changes in the prostate gland as well as to the enhanced production 
of PSA; therefore, elevated levels have long been used as a possible indication of the presence of p r ostate cancer 
inclu ding residual or recurrent disease after treatment ( Cooner et al ], Il990t ICatalona et al.L Il99ll : iBrawer et al 


2001 


1992) . Although recent studies que s tion P SA as a screening method for initial prostate cancer diagnoses (jBarrv 


iThompson et ah . 2004 : Mover . 2012 ). the examination of the rise in PSA levels post-cancer treatment is still 


considered by many to be a useful clinical tool for assessing the risk or presence of prostate cancer recurrence. 

The rises in PSA levels can lead to what is termed “biochemical failure”, which in itself is not currently con¬ 
sidered a clinical endpoint. However, biochemical failure is thought to importantly portend advancing (and possi- 
bly sub-clinical) disease. Prostate cancer mortality risk might also be affected by patterns in biochemical failures 
( Sartor et al.l . ll99RlD’Amico et al.l . l2003HBuvvounouski et al.l . l2008[) over time. However, because of its lack of direct 
clinical consequences, its use as a primary endpoint in clinical trials has been controversial. 

Nonetheless, characterization of the biochemical failure hazard over time, particularly within different patient 
subgroups defined by disease characteristics or treatment regimens, would provide a strong foundation for determining 
how this endpoint may relate to the levels of risk for clinical recurrence and death. A better understanding of these 
recurrence patterns over time could be of great value for clinical management, design of clinical trials, and biologic 
insights into prostate cancer progression in different population subgroups. 

The data we use to analyze biochemical failure hazard come from the Radiation Therapy Oncology Group 
(RTOG), which is a national clinical cooperative group that has been funded by the National Cancer Institute 
(NCI) since 1968 in an effort to increase survival and improve the quality of life for cancer patients. The group 
consists of both clinical and laboratory investigators from over 360 institutions across the United States and Canada 
and includes in its membership nearly 90% of all NCI-designated comprehensive and clinical cancer centers. The 
specific RTOG clinical trial we examine in this paper is RTOG 92-02, which is part of a series of RTOG clinical 
trials conducted from the 1980s to the present. These rich studies provide a wealth of data sources for studying the 
“natural history” of prostate cancer as it is presently defined and managed clinically. 

RTOG 92-02 was a multi-center study, designed with the primary objective of evaluating the effectiveness of 
androgen deprivation therapy on prostate cancer disease progression and survival. Between 1992 and 1995, 1,521 
participants with locally advanced high risk prostate cancer were accrued in over 200 treatment centers across the 
country. During the trial, all patients received 4 months of androgen deprivation (AD) therapy including goserelin 
and flutamide, in addition to external beam radiation therapy. Subjects were then randomized to either no further 
AD therapy (the “-|-0m AD group”), or an add i tiona l 24 months of goserelin (the “-|-24m AD group”) using the 
treatment allocation scheme described bv IZelen ( 19741) . and were stratihed according to stage, pretreatment PSA, 
grade, and nodal status. Given RTOG ’s long history of high quality, well-randomi zed clinical trials with strictly 
executed protocols in each institution ( Radiation Therapy Oncology Groin] . l2014b[) . analyses of the pooled data 
(wi thout considering c enter h eterogeneity) hav e been dominan t in p reviou s analyses of these trials (for example, 
see IChakravarti et al.l ( 2007 ). Ghe et al. ( 20071) . iHorwitz et al.l ( 2008 ). and Roach et al.l ( 2007)), and our an alysis 
here follows suit. Further protocol details and study description can also be found in Horwitz et al.l ( 20081) and 
Radiation Therapy Oncology Grouni ( 2014a ). 

For each patient enrolled, several measures of aggressiveness and severity of the original cancer were recorded at 
baseline: the Gleason score, T-stage of the tumor, and the PSA level at diagnosis. The Gleason score is assigned by 
a pathologist after microscopic examination of a tumor biopsy. Based on the degree to which the prostate cells have 
become altered, a Gleason score ranging from 2-10 is assigned, with scores between 2 and 4 indicating almost normal 
cells that pose little danger, and scores above 8 indicating very abnormal cells and a cancer that could be aggressive 
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Table 1: Sample characteristics of 1,421 patients in the final RTOG 92-02 trial data set, by treatment group, with 
age at diagnosis in 10-year increments, Gleason scores categorized by grade, and T-stage categorized into levels 2-4. 



-l-Om AD 

-|-24m AD 

Total Sample 




% 

Patients 

705 49.6 

716 50.4 

1421 100.0 

Less than 60 years 

A„ 60 to 70- years 

° 70 to 80- years 

80 or more years 

44 3.1 

274 19.3 

363 25.5 

24 1.7 

51 3.6 

260 18.3 

371 26.1 

34 2.4 

95 6.7 

534 37.6 

734 51.7 

58 4.1 

Low grade (2-4) 

Gleason score Intermediate grade (5-7) 
High grade (8-10) 

56 3.9 

462 32.5 

187 13.2 

47 3.3 

495 34.8 

174 12.2 

103 7.2 

957 67.3 

361 25.4 

T2 

T-stage T3 

T4 

325 22.9 

360 25.3 

20 1.4 

331 23.3 

353 24.8 

32 2.3 

656 46.2 

713 50.2 

52 3.7 


( Epstein et ah . 2005[) . 


The American Joint Gommittee on Cancer (AJCC) staging criteria is used to assign the tumor a T-stage, which 
indicates the extent that the primary tumor has spread. (In this analysis, we omit the ‘N’ and ‘M’ components of 
AJCC staging, as they are only applicable to non-localiz ed cancer cases). Because all patients in the RTOG 92-02 
trial were selected as “high-risk” by pre-specified criteria (IRadiation Therapy Oncology Groud . l2014ah . our data set 
only contains men with tumors of stage two (T2) through four (T4). Stage T2 indicates that the tumor can be 
felt during a physical examination, but has not spread outside the prostate, stage T3 indicates that the tumor has 
spread throu ghout the prostate (or the “prostatic capsule”), and stage T4 indicat es the tumor has spread beyond 
the prostate ( American Joint Committee on Canceii 2014 : Held-Warmkessel , 2006 1. 

In conjunction with the Gleason score and the T-stage, PSA levels at the time of diagnosis are an important 
component of prostate cancer staging, with very high levels frequently thought to be associated with a more severe 
form of prostate cancer. Since the Gleason score, T-stage, and PSA level are all important components of the cancer 
severity at the time of diagnosis, they, in addition to the age at diagnosis, will be considered as predictors in the 
biochemical failure analysis. 

The final data set considered in this analysis comprises 1,421 subjects, after the removal of 100 subjects with 
missing Gleason scores. Of those 1,421 subjects, 705 men (49.6%) received no additional AD therapy (were placed 
in the “J-Om AD therapy” group) and 716 men (50.4%) received additional 24 months of AD therapy (the “-|-24m 
AD therapy” group). The sample median time to biochemical failure is 4.9 years (SD = 3.9, range = 0.03 - 13.65). 
Biochemical failure was observed for 50.4% of the patients before the end of the study period. The sample median 
age at baseline was 70 years (SD = 6.5, range = 43-88). Table [T] summarizes the sample characteristics in more 
detail. 


3 Multi-resolution Hazard Model 


The hazard function of the time to biochemical failure T is defined as h{t) = limA-s-o P{t <t-|-A|r>t)/A = 
f{t)/S{t), where S{t) = P(T > t) is the survival function and f{t) = —S'{t) is the probability density function of T. 
While the hazard rate can provide a more detailed pattern over time that is not always visible in aggregate measures 
such as the survival curve or cumulative hazard function, it may also be more difficult to estimate reliably. This is 
particularly true if event counts are sparse, as is often the case in studies that follow subjects for extended periods 
of time. 

199^. They vary 


Various statistical estimators have been developed for the hazard function (jAndersen et al 


from classic parametric methods that assume a known family of failure time distributions such as Exponential or 
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Weibull , to sem i- or non-parametric smoothing methods such as those in Gray ( 199Cll . 1992 . 1996), Hess (1994), and 
Sargent! (119971) , t o methods u sing process priors in the context of n on-parametric Bayes ian hazard estimation. In 


this latter group, Hiort ( 1990h introduces the beta process prior and Lee and Kim ( 2004 ) deyelop a computational 
algorithm t o appr oxima te a beta proces s by generating a sample path from a compound Poisson process. Similarly, 
Kalbfieiscb (1978) and Burridgj ( 1981) model the cumulative hazard function as a gamma process. Correlated 


process priors, s uch as those used by Arias an d Gasbarral (Il994l) . rely on a martingale jump process to model the 
hazard rate. In Nieto-Baraias and Walkeil ( 2002 ). the prior correlation is introduced via a latent Poisson process 
be tween two adjacent ha zard increments. Other non - param etric Bayes hazard rate estimation methods are reviewed 
in Sinha and Dev ( 1997 ) and Muller and Rodriguez ( 20131) . 


Di fferen t predictors and covariates can be included in hazard modeling under the proportional hazards assumption 


( Cox . Il972l : llbrahim et al.l . 2001 : Bouman et ah . 2007 ). However, with longer-term follow-up, the assumption of 


constant proportionality between the hazard rates for different patient subgroups may be questionable, either because 
the effects change throughout the course of a study, or because the remaining patients constitute a subpopulation 
significantly distinct from the population at baseline. In these instances, it is important to relax the proportionality 
of hazards assumption over time. 

One of the simplest ways to accommodate non-proportionality of hazard functions among groups of patients is 
to perform a stratified analysis and estimate each group’s hazard function separately. However, this simple method 
cannot jointly estimate the hazard rates and effects of predictors, nor allow for correct quantification of uncertainty. 
M ore soph i sticated inv e stigations involve t he us e of p iece-wise hazar d functions, examples of which can be found 


HolfordI ( 19761 1980l) . Laird and Olivier ( 1981 ). and Taulbeel ( 19791) . Other methods address non-proportionality 


issues by pre-testing and comparison of two survival or hazard functions through graphics and asymptotic conhdence 
bands ( Dabrowska et al.l . 1989t Parzen et al. . 1997 : McKeague and Zhaol 200^. or through asymptotic confidence 


nptotic cc 

i^. l2012[) . 


Some of 


bands for changes in the predictor effects over time ( Wei and SchaubeT2008t Dong and Matthews 
these methods mentioned require large sample sizes for inference, and their performance can degrade over time in 
studies with sparser outcomes in later periods. Alternative approaches to handling non -proportionality have been 
implemented through accelerate d failure time (AFT) models (w ith initial work done by Bncklev and Jani^ ( 1979t ) 
and a thorough review found in Kalbfleisch and Prentic ^ ( 20021) ). which accommodate non-proportionality through 
specific parameterization of the time to survival and the covariates. As a result, these models can only accommodate 
certain types of non-proportionality, such as non-proportional hazards that do not cross. 


In the Bayesian context, non-proportionality has been addressed in a nu mber of ways; Berry et al.l (120041) es¬ 


timate piece-wise constant hazard rates for each of the treatment groups, and Nieto-Baraias (2^^~ejrtends a fre- 
quentist model by Yang and Prenti^ ( 2005h that estimates short and long term hazard ratios for crossing hazards. 


Hennerfeind et al.l (|2006i) have developed a survival model tha t incorporates functio ns for time-varying covariate 


effects into the hazard rate using Bayesian P-spline priors, while Cai and Meveil ( 2011) d evelop a Bayesian stratified 
proportional hazards model using a mixture of B-splines. Further. iPe lorio et al.l ( 20091 ) use a dependent Dirichlet 
Process (DDP) to non-parametrically estimate non-proportional hazards. We compare and discuss many of these 
models in Section 021 

The MRH model is closely related to the Polya tree ( Fergusonl . Il974 iLavinel 1992 ). The Polya-tree prior is an 
infinite, recursive, dyadic partitioning of a measurable space H. (Although in practice this process is terminated 
at a finite level M, resulting in “finite” Polya trees.) Polya trees h ave been adapted for mode li ng survival dat a 


in a numb e r of w ays (for example, see Muliere and Walkerl ( 19971): iH anson and Joh nson ( 2002 ): Hanson (2006); 


Zhao et all ( 20091) ). A stratified Polya tree prior is developed by Zhao and Hanson ( 2011 ), with the tree centered at 
the log-logist ic parametric family. The “bins” of the Polya tree are fused together in the “optional Polya tree” (OPT), 
developed bv lWong and Ma ( 2010l) . with fusing performed through randomized partitioning of the measurable space 
and a variable that allows for t he stopping of the partitioning i n different subregions of the tree. The “rubbery 
Polya tree” (rPT), developed by Nieto-Baraias and Miillei ( 2012h . smooths the partitioned subsets by allowing the 
branching probabilities to be dependent within the same level, implemented through a latent binomial random 
variable. 

The MRH prior is a type of Polya tree; it uses a fixed, pre-specified partition, and controls the hazard level within 
each bin through a multi-resolution parameterization. This parameterization allows parameters to differ across bins 
and levels of the tree in such a way that the marginal priors at higher levels of the tree are the same, regardless 
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of the priors at lower levels of the tree. The MRH model is capable of producing group-specific non-proportional 
hazards estimates (in the presence of proportional hazards covariates), allows for a data-driven fusing of bins (called 
“pruning”), and includes parameters that can control the smoothness and correlation between intervals. 


3.1 MRH Methodology for Mon-proportional Hazards (NPMRH) 

The basic MRH was extended in IPukic and Dismainl (j2007ll into the hierarchical multi-resolution (HMRH) hazard 
model, capable of modeling non-proportional hazard rates in different subgroups jointly with other proportional 
predictor effects. Like HM RH, the methodol ogy in this paper allows for group-specific hazard functions, but adds 
the pruning methodology of IChen et all ( 2014 1 for individual hazard rates. The pruning algorithm detects consecutive 
time intervals where failure patterns are statistically similar, increasing estimator efficiency and reducing computing 
time. The resulting method produces computationally stable and efficient infe rence, even i n per iods with sparse 
numbers of failures, as may be the case in studies with long follow-up periods ( Chen et al. . l2014ll . Details of the 
MRH prior and the pruning method are found in Appendix A. 


3.1.1 NPMRH Likelihood Function 

The biochemical failure time data is subject to right-censoring: a patient’s biochemical failure time is considered 
right-censored if it has not been observed before the end of the study period (time tj), or the censoring time (teens, 
where teens < tj). We denote Ti as the minimum of the observed time to biochemical failure or the right-censoring 
time for subject i. Each patient belongs to one of the C covariate (for example a combination of treatments) strata, 
and within each stratum we employ the proportional hazards assumption such that: 

htit \X,P) = /ibase,f(t)exp{X'/3}. 


Here, hi denotes the baseline hazard rate for treatment strata X represents the z x ni matrix of z covariates 
(other than those used for stratification) for the ni patients in the stratum £, while /3 denotes the z x 1 vector of the 
covariate effects. 

For subject i in stratum i who has an observed failure time at £ [0,tj), the likelihood contribution is: 

where Xij is that subject’s covariate vector, and Sbase,e is the baseline survival function for the stratum i. Similarly, 
for a subject in stratum i whose failure time is greater than the censoring time, the likelihood contribution is: 


Thus, the likelihood for all n patients in all C strata together (n = J2t=i i® 


L(T I ;9,H,R^,p,X) = n n [hbaseAT^,i)e^'-‘^] S^,aseAT^,i) 


i=l ieSe 




where Si denotes the set of indices for subjects belonging to the stratum £, and 6i^i = 1 if subject i had observed 
biochemical failure, and 0 if censored. In this model, the C hazard rates are estimated jointly with all the covariate 
effects. The estimation algorithm is performed two steps: the pruning step and the Gibbs sampler routine. Details 
are provided in Appendix B. 


3.1.2 Inference for Non-proportional Effects 

A non-proportional covariate effect can be described as the log of the hazard ratio between different covariate strata 
in each bin. For simplicity, let us assume that the model has only one non-proportional effect predictor with i 
categories - for example, i treatment groups. Let ai^i+i{t) denote the hazard effect of treatment group i + \ with 
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respect to treatment group i. Then can be thought of as a time-varying effect, which is constant within 

each time bin, but changes across time bins. In other words, 

— to ),..., — tj-i)] 

= log ^ [di^i+i/di ^£,..., dj^i+i/dj^i\ ^ , 

where dj/,j = 1,..., J,£ = represents the hazard increment in the bin of the strata, and is defined as 

dj = ^ h{s)ds = (see Appendix A for details). For the biochemical failure model, the time-varying 

effect of treatment is thus: 

00,1 (0 = log ^[di,i/di,o, ■ • ■, dj.i/dyo] ^ , 

where 0 represents the short-term androgen deprivation therapy group, and 1 represents the long-term androgen 
deprivation therapy group. The marginal posterior distribution of this time-varying effect of treatment can therefore 
be obtained directly from the joint posterior distribution for the hazard increments d. 


4 Analysis of RTOG Prostate Cancer Clinical Trial Data 


The main goal of our analyses of the time biochemical failure was to infer how the effects of AD therapy impacted 
the failure time, and if and how they may have changed over the course of treatment and the subsequent 10 year 
post-diagnosis follow-up. Of particular interest was to assess whether the benefit of longer over shorter duration 
androgen deprivation (AD) therapy was persistent over time or if it diminished in the late follow-up. Additionally 
of interest was whether the biochemical failure hazard rate for the long-duration AD (-|-24m) group increased later 
in follow-up, indicating that failures were deferred in time rather than avoided. Inference in the later time periods 
is more challenging however, as the RTOG 92-02 clinical trial exhibits sparsity of events towards the end of the 
follow-up time as is typical of long-term studies. Fewer observed biochemical failures occurred in the later periods, 
with only 13% of the subjects having observed biochemical failure after 4.9 years (the median time to biochemical 
failure), and only 1.5% after 10 years. 

Several previous studies have used time aggregated summaries (i . e., survival curves , cum ulative incidence) to 
estimate cumulative biochemical failure risk over time ( Nguven et ah . [20131 : iTaira et ah . 2013). However, previous 
analyses specifically examining the hazard rate of biochemical failure are relatively scarce. These analyses have been 
limited to the clinical literature and used intuitive summaries to appr oximate the annual hazard, for example, by 


calculating the annual number of events div ided by the number at risk ( Amling et ah . 2000l : Dillioglugil et ah . 1997t 


Hanlon and Hanfl , 2000t Walz et ah . 2008). These analyses provide basic, useful information on the patterns of 
hazard of biochemical failure over time, and have helped identify higher and lower periods of hazard for specific 
patient groups. However, the straightforward methods used do not provide smoothed estimates of the hazard rate 
over time, and the joint estimation of the effects of multiple covariates on hazards was not considered. In addition, 
during periods of time when no biochemical failures were observed, these estimation procedures calculated the hazard 
rate to be zero. 

Thus, the analyses here investigate the effects of treatment, age, Gleason scores, PSA levels, and T-stage at diag¬ 
nosis on the time to biochemical failure, allowing for possible non-proportional treatment effects. In order to provide 
a thorough investigation of the treatment hazard ratio over time and to determine the effects of different modeling 
and smoothness assumptions on the estimate, we present a suite of models ranging from simple parametric models 
to more complex non-parametric models. In addition to the standard and pruned MR H models, w e also employ a 


param etric non-proportional hazards Weibull model, two piece-wise exponential models ( IZelenl. 1974; Ibrahim et al 
2001 1. an extended Gox model that allows for time-varying cov ariate effects ( Martinussen and Scheike . 2006^ . a 


Dependent Diric hlet Process survival model (De lorio et al. . 2009ll . and a semi-parametric Bayesian accelerated fail- 
ure time model (Komarek and Lesaffrel . 1200711 . All MRH models were implemented using the “MRH” R package 
( Hagar et al. . l2014hh . 

The following covariates were examined in the models: 
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• Treatment (+0m vs +24m AD therapy) 

• Age (categorized: less than 60 years, 60 to 70- years, 70 to 80- years, and 80 years or older) 

• Gleason scores (categorized: low grade, intermediate grade, and high grade, corresponding to scores between 
2-4, 5-7, and 8-10, respectively) 

• PSA levels (log transformed and then centered at the mean log value equal to 3) 

• T-stage (binary: stage 2, or stage 3/4) 

The baseline (reference) group comprises subjects who received no additional androgen deprivation therapy (-fOm), 
had an intermediate Gleason score, were below age 60 at study entry, and had a T-stage equal to 2. 


4.1 MRH Results 


The time resolution M = 6 was chosen for this analysis in order to provide a fine grain examination of biochemical 
failure patterns over the course of more than 13 years. The resulting J = 64 time intervals, partitioning the time axis 
into bins of length 2.5 months, allowed us to investigate detail in the biochemical failure hazard rate that is useful 
to clinical practice. The full 64-bin MRH model with non-proportional treatment hazards (the “NPMRH-0” model) 
was compared to two pruned MRH models with non-proportional treatment hazard rates, one with all 6 levels of the 
MRH tree pruned (“NPMRH-6”), and one with only the bottom 3 levels of the MRH tree pruned (“NPMRH-3”). 
Pruning the 64-bin model allowed us to fuse bins where the failure rates were statistically similar (reducing the 
number of model parameters), which in turn helped us identify periods of time where the hazard rates were flat and 
where treatment effects remained steady. To test whether the non-proportionality of hazards was indeed warranted 
in the model, we also examined a pruned MRH model (a 6-level model with the bottom 3 levels subject to pruning) 
with the treatment effect included under the proportional hazards assumption (PHMRH). 

Five separate Markov chain Monte Carlo (MCMC) chains were run for each model, each with the burn-in 
of 50,000, leaving a total of 1 50,000 thinned iterations in each chain for analysis. Converge nce was determined 
through the Geweke diagn o stic ( Gewekel . [l9^ . graphical diagnostics, and Gehnan-Rubin tests ( Gelman and Rubinl . 


1992t iBrooks and Gelmanl . Il998f) . These diagnostics are automatically presented when using the MRH R package 


(IHagar et al.l . l2014bl) . Point estimates for the MRH models were calculated as the median of the marginal posterior 


distribution of each parameter. Central credible intervals were used for inference. 

One of the most notable features of our results are in the estimated log-ratio of the treatment effect, particularly 
in the pruned models. In examining Figure [U we note the interval-specific differences displayed in the caterpillar 
plots (top 3 plots in Figure[T|). These plots reveal several important pieces of information about the treatment effects, 
including time periods where the treatment effects were: 1) proportional (constant) or non-proportional (changing), 
2) statistically significantly different from previous periods, and 3) statistically significantly different from zero or 
from the proportional hazards model estimate of the treatment effect. For example, in the NPMRH-3 model, the 
treatment effects remained steady between (approximately): 6 ms-2 years, 3.5-4 years, 5-6.5 years, 6.5-8 years, 
and 8-10 years. However, these periods of constant estimated treatment effects were different from one another, 
suggesting that while the benefits of treatment lasted for a certain number of years, the degree of improvement 
changed (and generally declined) over the course of the study. Between 6 months and 2 years, long-term AD therapy 
had an estimated 75% improvement over short term AD therapy. In examining the 95% bounds of the boxplots, this 
estimated log-hazard ratio is statistically different than the log-ratio estimate from the proportional hazards model 
of Ptx = —0.597 (which translates to 45% improvement for the -|-24m group). Additionally, the estimated log-ratio 
in this time period is statistically significantly different than the estimated treatment effects between 5-6.5 and 8-10 
years, which only showed an estimated 26% improvement for subjects on long-term therapy (see Figure [21 bottom). 
In both pruned models, the treatment effect held steady for a certain number of years, then diminished slightly, and 
held steady for another number of years, before diminishing in effectiveness again. Overall, long-term AD therapy 
did better in prolonging time to biochemical failure throughout most of the first 10 years of the study, despite the 
fact that in certain periods the log-ratio is not statistically significantly different from zero. 

The results of all the MRH models provide two important insights: 1) The proportional hazards assumption 
indeed did not hold for treatment effects (agreeing with the Cox model test), and there were in fact periods of time 
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Figure ll Comparison of parameter estimates produced by the different MRH models. TOP: Caterpillar plots of the treatment log-hazard 
ratio for the three non-proportional MRH models with different levels of pruning. The model on the top has no pruning (NPMRH-0), 
and shows the most variation between and within bins. As the level of pruning increases (NPMRH-3 and NPMRH-6), the uncertainty of 
the estimates decreases. In addition, the pruned models communicate the patterns in the log-hazard ratio more clearly: larger treatment 
differences are visible at the beginning and end of the study, with long periods of stability in the middle. All caterpillar plots have 
two reference horizontal lines: the grey line crosses the y-axis at 0, and the blue line at -0.597 (the estimate of the treatment effects 
under the proportional hazards using the the PHMRH model). BOTTOM LEFT: Covariate effect estimates and corresponding credible 
bounds for the MRH and NPMRH models. The figure shows that the estimated covariate effects are very similar among all models, 
regardless of the level of pruning or the treatment effect proportionality assumption. BOTTOM RIGHT: The smoothed log hazard ratio 
of the long-duration AD (-|-24m) group to the short-duration (H-Om) group for NPMRH-0, NPMRH-3, and NPMRH-6 models, contrasted 
against the estimate of treatment effect under the proportional hazards assumption (obtained from PHMRH). The solid lines represent 
the log-hazard ratio estimate, and the dashed lines represent the smoothed point-wise 95% credible intervals. The hazard rate estimates 
are very similar among the three models. However, the two models that have pruned trees have narrower credible intervals, as they have 
a smaller number of estimated parameters. Note that in all models, the credible bands become large towards the end of the study, due to 
the decreasing frequency of observed biochemical failures as time progresses. However, the pruned models show lower variability towards 
the end of the study period, as the per-bin observed failure count is higher in those models. 
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Coefhcient 

P 

HR 

95% Cl for P 

Gleason score g 

-0.11 

0.30 

0.90 

1.35 

(-0.43, 0.18) 
(0.13, 0.46) 

Age 60-70- 

^ 70 - SO¬ 

SO or older 

-0.12 

-0.44 

-0.07 

0.88 

0.64 

0.94 

(-0.39, 0.16) 
(-0.71, -0.16) 
(-0.52, 0.37) 

T-stage 3 or 4 

0.14 

1.15 

(-0.01, 0.29) 

log (PSA), centered 

0.29 

1.34 

(0.21, 0.37) 


Table 2: Estimated covariate effects for the NPMRH-3 model. Baseline subjects are those who have a Gleason score between 5 and 7, 
are less than 60 years of age, and have a T-stage equal to 2. The 0 column contains the estimates of the log of the hazard ratios, and the 
HR column contains the coresponding estimated hazard ratios. Model estimates for all the different MRH (with and without pruning) 
are very similar, as can also be observed in Figure [T] 


where the estimated effects are statistically significantly different from each other, and 2) On average, the subjects 
on +24m of AD therapy experienced benefits for at least 10 years post treatment. 

In Figure[T] (bottom right), we present the smoothed version of the caterpillar plots above, illustrating the overlap 
of the credible regions around the estimated log-hazard ratio for the four different models. The smoothing was done 
using a cubic smoothing spline (with 5 degrees of freedom, 53 knots, and a smoothing parameter equal to 0.82), 
which was implemented via the smooth, spline() function in R. While the caterpillar plots are useful for identifying 
specific interval differences in the treatment effect, these smoothed plots emphasize the difference in the uncertainty 
among the models and the different shapes of both the estimated effects and their credible intervals. For example, we 
see that among the NPMRH models, the unpruned model (NPMRH-0) has the widest credible interval bands, while 
the fully pruned model (NPMRH-6) has the narrowest credible interval bands, which is due to the smaller number 
of estimated parameters and larger failure counts per bin in the pruned model. While the PHMRH model clearly 
has the narrowest credible region, the constant parameter estimate cannot identify periods of increased or decreased 
long-term treatment benefit. This discrepancy is particularly visible in the last third of the study, where the benefits 
of long-term treatment seem to be decreasing. 

The estimates and their 95% credible intervals for the time-invariant effects (effects of age, Gleason scores, PSA 
measures, and T-stage) are almost identical among all the MRH models, and are shown in the “cat-scratch” plot in 
Figure [H bottom left. 

In all models, estimates of the biochemical failure hazard rate for each treatment group showed an increase in the 
first two to four years, with a steady decline afterwards (see Figure [2l upper left). However, subjects who received 24 
months of additional AD therapy had a lower hazard rate than those who did not, with a flatter peak between 2 and 
4 years. The non-proportionality between the hazards is particularly visible when compared to the results from the 
proportional hazards model. While both the NPMRH-3 and PHMRH models show similar estimated hazard rates 
for the -|-0m AD therapy group, the estimated hazard rates for the -|-24m group had significant departures in the 
first four to five years of the study, as well as in the last two years of the study. It does appear that, while long-term 
treatment effects diminished over time, biochemical failure was not simply postponed for the -|-24m group, but the 
risk was in fact reduced even over a longer period of time. 

Time-invariant effect estimates show that an increase in Gleason scores was associated with an increased hazard 
rate, with a statistically significant difference between baseline subjects and subjects with scores greater than 8 (HR 
= 1.35, 95% GI: 1.14, 1.59). The hazard of biochemical failure decreased with age, although significant differences 
were only observed for subjects between 70 and 80 years old and baseline subjects (HR = 0.64, 95% Cl: 0.49, 0.86). 
As expected, subjects with a T-stage of 3 or 4 had a higher hazard of biochemical failure compared to subjects with 
a T-stage equal to 2 (HR = 1.15 , 95% Cl: 0.99, 1.34). Similarly, for every point increase in PSA scores on the log 
scale (a 2.7 factor increase in PSA measures on the standard PSA scale), there was a statistically significant 34% 
increase in the hazard rate. (See Tabled Figure [T] bottom left.) 

The probability of biochemical failure at 1, 5, and 10 years can be observed in Figure |3l which shows the smooth 
posterior predictive probability densities of biochemical failure, stratified by treatment type for hypothetical subjects 
with a“worst” or “best” covariate profile. A subject with a “worst” profile had a Gleason score > 8, a T-stage 3 or 4 
tumor, and a PSA score equal to 1 standard deviation greater than the mean (PSA ss 52). A subject with a “best” 
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Figure 2: TOP LEFT: Smoothed estimated hazard rates (baseline subjects only) for the H-24m AD therapy group (red) compared to 
the +0m AD therapy group (black). The hazard rates estimated under the non-proportional assumption are represented with solid lines, 
and the hazard rates estimated under the proportional assumption are represented with dashed lines (calculating the +24m hazard rate 
at time t as /io(t) exp{/3ta;}). While the estimated hazard rate for the +0m AD therapy group is similar under both the proportional 
and non-proportional modeling assumptions, the -|-24m hazard rate estimates have larger departures, with a flatter 2-year peak for the 
estimate from the non-proportional hazards model. TOP RIGHT: Smoothed estimated hazard rates (baseline subjects only) and 95% 
credible interval bounds for the -t-24m AD therapy group (red) compared to the H-Om AD therapy group (black). The intervals are 
slightly narrower for the -h24m treatment group when compared to the H-Om treatment group, although the credible intervals for the 
-)-24m estimated hazard rate become wider at the tail end of the end of the study where few failures are observed. BOTTOM: A caterpillar 
plot of the effects of long-term vs short-term treatment over time. The grey line lies on the y-axis at 0, and the blue line lies on the y-axis 
at -0.597, which is the estimate of the treatment effect under the proportional hazards setting (estimate from the PHMRH model). It 
is particularly apparent at the beginning of the study that the proportional hazard rate estimate for treatment is not contained in the 
boxplot bounds. In addition, we can see that the boxplot medians have a lot of variation, and even change from negative to positive 
multiple times throughout the course of the study. All estimates shown are from the NPMRH-3 model. 
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Figure 3: Smoothed posterior predictive densities of biochemical failure at 1, 5 and 10 years post-diagnosis, stratified by treatment type 
and hypothetical patient covariate profile. A subject with a “worst” profile had a Gleason score > 8, a T-stage equal to 3 or 4, and a 
PSA score equal to 1 standard deviation greater than the mean (PSA ^ 52). A subject with a “best” profile had a Gleason score < 4, a 
T-stage equal to 2, and a PSA score equal to 1 standard deviation below the mean (PSA 8). At one year post-diagnosis, the predictive 
densities of biochemical failure were very similar for all groups. However, at 5 and 10 years the densities became more spread out. A 
worst profile subject on H-Om of AD therapy had the highest predictive probability of biochemical failure, while a best profile subject on 
-|-24m AD therapy had the lowest predictive probability of biochemical failure. While all densities overlapped at one year, at 5 and 10 
years very little overlap remained between the best and worst profile within the same treatment regimen. In addition, at all three time 
points, the predictive probability of biochemical failure was higher for the worst profile subjects (regardless of treatment) than the best 
profile subjects. Smoothed density estimates were calculated using densityO in R. 


profile had a Gleason score < 4, a T-stage equal to 2, and a PSA score equal to 1 standard deviation below the mean 
(PSA Ri 8). At one year post diagnosis, we see that the posterior predictive densities were very similar among the 
four groups, all concentrated between 0 and 20%. However, by the 5-year post-diagnosis mark, the failure densities 
were very different. A worst profile subject on -|-0m AD therapy had failure probability centering around 80%, while 
a best profile subject on -|-24m of AD therapy had failure probability centering around 20%. It can also be observed 
that a worst profile subject had higher failure probability than a best profile subject, regardless of treatment type. 
The failure probability at 10 years post-diagnosis is perhaps the most telling, with a worst profile subject on -|-0m 
AD therapy having a failure probability ranging from 80-100%, which is a narrower range when compared to the 
other groups. Meanwhile, a best profile subject on -|-24m therapy had failure rates centering around 40%, over a 
wider interval from approximately 20% to 60%. While all posterior predictive densities overlapped at one year, at 5 
and 10 years there was only a small amount of overlap between the best and worst profile subjects within the same 
treatment regimen. 


4.2 Model Checking and Comparison 


To assess the impact of different modeling and smoothness assumptions on the hazard of time to biochemical failure, 
we compared the four MRH models to each other as well as to other models, including the Cox proportional hazards 
model, a parametric non-proportional hazards Weibull model, two piece-wise exponential (PE) models, a dependent 
Dirichlet Process (DDP) survival model, and a semi-parametric Bayesian accelerated failure time (AFT) model, 
allowing for time-varying treatment effects in all models. In addition, we performed a sens itivity analysis to the 
parameter k, which controls the smoothness in the MRH tree prior ( Bouman et al.l ( 20051) . see Appendix A for 
details). When applicable, models were compa red through a goodne ss-of-fit measu re (defined in Section 14.2.3L as 


well as via information c riteria including BIG ( Schwar j. 1978i) . AIC ( Akaike . 1974 ). and DIG ( Soiegelhalter et al 
I 2 OO 2 I: iGeleux et al.l. l2006l) . 


12 



































0 

E 

cS 

0 


o 

-»—< 
cCl ' 








4 6 8 10 

Time 


Figure 4: Schoenfeld residuals for the treatment effect, 
with a weighted least squares line. Deviation from linearity 
indicates that the treatment effect does not fit the propor¬ 
tional hazards assumption. 


Cox Proportional Hazards Model 

Because the Cox proportional hazards model is widely used in the an alysis of sur vival data, we included this model 
as a comparison to other non-proportional hazards model for contrast (|Coxl . ll972l) . While we modeled the treatment 
effect under the proportional hazards assump tion, it is important to note that the Schoenfeld residuals and methods 
presented bv lGrambsch and Therneau ( 1994h showed evidence that the treatment effect (long-term versus short-term 
therapy) was not proportional over the entire study period (see Figure |4]). No other covariate effects showed evidence 
of non-proportionality over time. 


Cox Model Extension 

In addition to the traditional Cox pro portional hazards model, we inc luded an extended proportional hazards model 
with a time-varying treatment effect (jMartinussen and Scheika 2006). This extended Cox model has a hazard rate 


with the form 


X[t) = Y (t)Ao(t) exp{W^(t)/3(t) -h 


( 1 ) 


where {X{t),Z{t)) is a (p-|-g)-dimensional covariate, /3(t) = (/?i(t),...,/3p(t)) is a p—dimensional time-varying (i.e. 
NPH) regression coefficient that is estimated non-parametrically, and 7 is the q-dimensional regression parameter 
for the PH covaria te effects. An implementation of this model was performed using the timecoxO function in the 
“timereg” package (jScheikel . 120141 1. where parameters are estimated using score equations, and the optimal smoothing 
parameter was chosen based on the lowest -2*log-likelihood value and the lowest GOF scores. 


Accelerated Failure Time Model 

Accelerated failure time models are an al ternative way to invest i gate t he effects of non-proportional hazards. In our 
analysis, we used a Bayesian AFT model (IKomarek and Lesaffrd . 120071) . This model can accommodate more complex 
clustered, interval-censored survival data, with the log of the survival times is modeled as: 

log(Ti,i) = + b[zi^i + i = l,...,N,l = l,...,ni, 

where Ti^i is the event time of the observation of the cluster. The model estimates the effects, /3 = (/?i,..., /3p), 
of the fixed effects Xi^i and the i.i.d random effects bi = ( 6 ^ 7 ,..., bi^qf". The fixed and ran d om co variate effects are 


wx 1.x.xx X cxxxvav-»xxx v^xxv.v^uo ■ 5 • -t- cxxxxx xctxx v-x yxxx vcxxxc-xuv^ v^xxv^v^uo cxx 

modeled using the classical Bayesian linear mixed model approach (such as Gelman et al. ( 2004h '). and the hazard 
rate is approximated by normal mixtures. For the purposes of the prostate data, we omit any clustering effects. 


Dependent Dirichlet Process Survival Model 

As the most flexible alternative, we also consider a non-parametric Bayesian model that can accommodate non¬ 
proportional hazards. This model is based on the ANOVA Dependent Dirichlet Process (DDP) model presented 
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in De lorio et al. ( 2004li . that has been extended to the survival analysis setting ( De lorio et al. . 2009ll . The DDP 
survival model performs survival regression based on a Dirichlet Process prior. 

The set of the random probability distributions or functions are dependent in an ANOVA-type fashion: If {Fa;, a; S 
V} is the set of random distributions indexed by the categorical covariates x = (xi, ... ,Xp), and the collection of 
random distributions is defined as 

OO 

Fx = '^phSiOxh), for each x G X, 

h=l 

with Yl’h^iPh = 1 ) Sijj) representing a p oint mass at y, then de pend ence is introduced by modeling the locations 


Oxh through the covariates (as explained in iDe lorio et al.l (|2004r) and iDe lorio et al.l (|2009ll l. This model allows 


all group-specific hazards to be modeled non-proportionally, and covariate effects are interpreted in the standard 
ANOVA manner. The model is also capable of accommodating continuous covariates. 

Because of the greater flexibility with the DDP survival model, this model is often unable to estimate all desired 
covariate effects. As a result, we present this model on a reduced set of variables that were found to be significant 
in other models: treatment, a high Gleason score, age between 70 and 80 years, and log(PSA). 


Non-Proportional Hazards Weibull Model 

The non-proportional effects Weibull model was designed with separate Weibull hazard rates for each treatment 
group and proportional hazard covariate effects shared among both treatment groups. Parameter estimates for this 
model were obtained using numerical optimization of the likelihood function: 


L : 


n 

e=i 


(^KiXe ^exp|v{^/3|^ ’ exp exp 




where /3 are the covariate effects modeled under the proportional hazards assumption. The estimate of the log-hazard 
ratio of the non-proportional effect of treatment at time t > 0 in the Weibull model was then obtained as 


aw{t) 


( KiAi (Ait)”^ ^ \ 

l^^oAo (Xotr-^J ’ 


where group 0 is the short-term treatment group, and group 1 is the long-term treatment group. The non-proportional 
hazards Weibull model parameter estimation was not performed using any available software packages, but is available 
on request from the authors. 


Piece-wise Exponential Models 

The piece-wise exponential (PE) model is a commonly u sed frequentist s emi-parametric model for joint estimation 
of the hazard rate and covariate effects (for example, see iFriedmanI ( 198211 1. It is similar to the MRH model in that 
both assume constant hazard rates within a time bin j (j = 1,..., J), but it does not have the multi-resolution 
aspects of MRH. 

As with the non-proportional hazards Weibull model, we fit a PE model with separate hazard rates for each 
treatment group, and shared proportional hazards effects among all subjects. If we let Xjj represent the constant 
hazard rate in the bin for the treatment group (j = 1,..., J, and £ = 1,2), then the piece-wise exponential 
likelihood can be written as: 


where 


n z J ^ 

L(T|^,X,<5,A) = nnn (^Xj^e exp{X'/3}^ exp Aj,^ exp{X';0}|, 

i=l£=lj=l 


r f 1 if subject i is in treatment group £ and has a failure in time bin 7 
= I 0 otherwise, 
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{ tjj — tj-i/ if subject i is in treatment group ^ and Ti/ > 

Ti — if subject i is in treatment group i and Ti g € 

0 otherwise. 

To make the PE model comparable with the pruned MRH models, we use a data-driven method to select the 
optimal number of bins J, as well as the optimal bin width(s). In addition, we modify the standard PE approach 
slightly in order to overcome a common obstacle in the estimation of the variance. Namely, given that the Fisher 
Information for the hazard rate in bin j for group £ can be derived as: 


Hhi) 




bins with no observed failures will yield I{Xj g) of zero, making the Fisher Information matrix singular. In such 
instances, we have remedied this issue by (repeated) merging of the bins with no observed failures into the adjacent 
bins to the left. With that modification, for each of the hazard rates, w e hnd the PE model with the optimal number 
of bins and bin widths based on an information criterion such as AIC ( Akaikel . ll974ll . in two ways: 


1. Equal-bin model: The “equal-bin” PE model partitions the time axis evenly into j bins (where j = 2,..., J). 
Among the J — 1 equal-bin PE models, we retain the model that has the lowest AIC value. 

2. Quantile-bin model: The “quantile-bin” PE model partitions the time axis into j quantiles {j = 2,..., J). 
Among the J — 1 quantile-bin PE models, we retain the model that has the lowest AIC value. 


In the RTOG 92-02 data set, the final equal-bin PE model had 17 bins for the -|-0m treatment group hazard rate 
and 18 bins for the -|-24m treatment group hazard rate. The last three bins were combined for the -l-Om treatment 
group, and the last two bins had to be combined for the -|-24m treatment group hazard rate. (Note that as the 
result, not all bins were of equal length due to the combined bins at the end of the study). The final quantile-bin 
model had 24 bins for the -l-Om treatment group hazard rate and 25 bins for the -|-24m treatment group hazard rate. 
The last three bins were combined for the -l-Om treatment group, and the last two bins were combined for the -|-24m 
treatment group. 


4.2.1 Comparison of Estimated Hazard Ratio and Predictor Effects 

The estimates from all models are compared visually in Figures [6] and [5l All models were remarkably similar in terms 
of the proportional hazard covariate effects, both in point estimates and their 95% bounds (credible intervals for the 
MRH model, and confidence intervals for the remaining models), as can be seen in Figure [51 Given this similarity, 
we refer the readers to Table [2] and Section 14.11 for interpretation and discussion of these effects in the NPMRH-3 
model. (Note that the estimated covariate effects for the AFT and DDP survival models are not shown as their 
number and interpretation are different than the other models.) 

In contrast, the estimates of the time-varying treatment effect show notable differences (Figure |6|). The PE and 
MRH models provide similar estimates, although the PE model log-HR estimates exhibit a rapid increase towards 
the end of the study when the number of observed failures becomes sparse (top right graph). Due to its parametric 
form, the NPH Weibull model has an initial dip in the estimated treatment effect, and then slowly but steadily 
increases over the course of the study, although the estimated effects remain negative throughout the study period. 
The extended Cox model shows a similar pattern and estimate, without the initial dip (top left graph). The AFT 
estimated log-HR follows a trajectory similar to that of the NPH Weibull model, including the wider 95% confidence 
interval bounds in the initial study period (bottom left graph). The DDP survival model (calculated on the subset 
of significant predictors only) shows an initial pattern similar to that of the NPMRH-3 model (also calculated on 
the subset of significant predictors only), with a dip at the two year mark, followed by an upward trend. The DDP 
survival model is the only model that shows a possible decreasing trend towards the end of the study. For comparison, 
the Cox PH model treatment estimate (included under the proportional hazards assumption), has been included in 
all graphs as a constant value over time {/3 = —0.59, 95% Cl: -0.74, -0.44). It can be observed that throughout 
various periods of the study, all models have estimated treatment effects that extend outside the 95% confidence 
interval for the Cox PH model treatment effect. In addition, all models show that long-term treatment is beneficial 
over longer periods of time, even if the effects may be diminishing. 
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Figure 5: The estimates for the proportional hazards covariate effects and corresponding 95% intervals (credible intervals for NPMRH-3, 
confidence intervals for the remaining models). Differences are minor, even among the widths of the 95% intervals, between the different 
models. The AFT and DDP survival model estimates are not included as their number and interpretation is different than the standard 
Cox PH covariate interpretation. 


4.2.2 Sensitivity Analysis to Parameter k in the MRH Models 

In all the MRH mod els, the parameter k controls the correlation among the hazard increments within each bin 
(|Bouman et al.l (|2005l) . see Appendix A for details). The default value for k in the above analyses was 0.5, which 


implies zero a priori correlation among the hazard increments. However, when k > 0.5, the increments are positively 
correlated a priori, and, similarly, when k < 0.5, the hazard increments are negatively correlated a priori. Another 
way to understand the impact of k is that higher values lead to smoother hazard functions. 

In practice, different approaches to choosing a hyperprior fo r k, including empiric al Bayes methods, are possible. 


However, k will in general tend to dep end on the resoluti on level (jBouman et al.l . 120051) . as well as with the significance 
level used in the pruning algorithm ( Chen et al.l . 120141) . Both the resolution and the pruning can be also used to 


imply the desired a priori level of smoothness of the hazard function. For this reason, we fix k in the above analyses, 
and perform a sensitivity analysis to examine the effect the choice of k might have on the posterior hazard rate 
estimates. We only examine the effects of different values of k in the 3-level pruned MRH model (NPMRH-3) (see 
Subsection 14.2.31 for motivation.) 

The sensitivity analysis results are displayed in Figure [T] On the left plot in Figure [7l the original NPMRH-3 
model (with k = 0.5) is contrasted against the models with negatively correlated hazard increments (fc = 0.2), and 
positively correlated hazard increments {k = 1.0). As anticipated, in the negatively correlated model the log-HR 
is less smooth, and has wider 95% credible intervals, resembling the PE model results. However, the NPMRH-3 
model with fc = 1.0 is smoother, with narrower 95% credible intervals. The positive correlation between hazard 
increments results in smoother posterior estimates, as more information is shared across bins. The right graph of 
Figure [7] highlights the adaptability of the MRH model in controlling the smoothness of the log-hazard ratio through 
the parameter k. In this instance, with k fixed at a very high value of 10 (highly positively correlated increments), 
the NPMRH-3 model closely mimics the results of the parametric NPH Weibull model. 
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Figure 6: Comparison of the estimated treatment log-HR from the different models. TOP: The estimated log-hazard ratio of the -\-2Am 
AD therapy effects over time, with 95% bounds (smoothed point-wise credible intervals for the MRH model, and point-wise confidence 
intervals for the remaining models). In the top left graph, it can be observed that after the initial dip in the NPH Weibull model estimate, 
the log-ratio slowly but steadily increases over the course of the study. The 95% confidence interval bounds for the log-HR of the NPH 
Weibull model are much narrower than most of the other models, which is expected due to dramatically fewer parameters estimated 
in that model. The extended Cox model shows the same upward trajectory, although the initial dip is not pronounced, which is likely 
due to the choice of the smoothing parameter. The top right graph shows similarities between the log-HR estimates for the MRH and 
adjusted PE models, although the PE models have a sharper upward trend towards the very end of the study. BOTTOM: The estimated 
log-hazard ratio of the -|-24m AD therapy effects over time, with 95% bounds (smoothed point-wise credible intervals for the MRH model, 
and approximated point-wise confidence intervals for the AFT and DDP survival models). On the left, the AFT model shows a similar 
pattern to the NPH Weibull model, with an initial dip at the beginning of the study, followed by an increasing estimate over time. On 
the right, the DDP survival model (calculated on the subset of significant predictors), is contrasted against the NPMRH-3 model (also 
calculated on the subset of significant predictors), where they show a similar pattern, with an initial log-HR estimate greater than the 
Cox PH estimate, followed by a dip at 2 years. Unlike the other models, the DDP survival model estimated log-HR treatment effect 
decreases slightly towards the end of the study. For contrast, the Cox PH model estimated treatment effect and 95% confidence interval 
is shown in all figures. All models have periods where the estimated treatment log-HR extends outside the 95% confidence interval for 
the Cox PH model treatment effect. The estimated hazard rate and 95% credible interval bounds for the AFT mod el were predicted for 
all covariate groups using the bayessurvregl() function found in the “bayesSurv” package in R jKoma^ a 120151), and were predicted 
for the DDP survival model using the LDDPsurvivalO function in the “DPpackage” package in R~ l|Jara et al.L l2012li . In both models, 
the bounds were used to approximate the point-wise variance of the hazard rates, which were then used to approximate the point-wise 
variance of the log-HR. 
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Figure 7: LEFT: Comparison of the smoothed, estimated log-hazard ratios for the NPMRH-3 model with k values equal to 0.2, 0.5 (the 
default MRH model, denoted with a and 1.0, with estimates shown in black and smoothed point-wise 95% credible interval bounds 
shown in grey. The model with k = 0.2 is the least smooth, with the largest 95% credible interval bounds. Conversely, the model with 
A: = 1 is smoother, with narrower 95% credible interval bounds, as the positive correlation allows more “shared” information between 
the hazard increments. RIGHT: The estimated log-hazard ratio for the NPMRH-3 model with k = 10 contrasted with the NPH Weibull 
model from the previous section. This figure highlights the adaptability of the MRH model; if k is fixed at a large value (making the 
hazard rate quite smooth), the NPMRH-3 model closely mimics the results of the Weibull NPH model. Note that the two figures do not 
have the same y-axis scale. 


4.2.3 Model Performance Comparison 

In addition to model parameter comparisons in Subsection l4.2.11 the set of models were also compared based on their 
goodness of fit (GOF), as well as several information criteria. The GOF was evaluated using the following simple 
measure over time: 


GOF{t) 


1 

-E 

"•fcr 


|/i{biochemical failure occurs > t} — P(subject i experiences biochemical failure > f)|, 


where | ■ | denotes absolute value, {biochemical failure occurs > t} is an indicator variable which equals 1 if the 
subject i fails after time t and equals 0 otherwise, and P(subject i experiences biochemical failure > t) is the model- 
based probability of the subject i experiencing biochemical failure. This probability is found based on the estimated 
model parameters (posterior medians, or maximum likelihood estimates) and covariates for subject i. Patients who 
were censored before time t were not included in the GOF calculation at time t. Therefore, nt represents the total 
number of patients in the cohort minus the number of patients censored before time t, so that the maximum value 
the GOF statistic can take is 1. In other words, the GOF measure calculates the average difference between the 
observed failure time and the probability of failure at that time point. Lower GOF values indicate more accurate 
failure approximations and a better fitting model. 

Results from the GOF statistic calculations are shown in Figure [H Most models show very similar results and 
trajectories, with exception for both of the adjusted PE models and the extended Gox model. The adjusted PE model 
with equal bins has the worst survival prediction initially, followed by the extended Gox model and the adjusted PE 
model with bins determined through quantiles. After four years, the extended Cox model has the highest GOF of 
all models. Differences between the MRH models (including those with different values of the prior parameter k) are 
negligible, and also very similar to the results for the NPH Weibull and AFT models. Note that the GOF statistic 
was not calculated for the Cox proportional hazards model, as no estimate of the hazard rate is typically produced 
by Cox models. That statistic was also not calculated for the DDP survival model, as subject-specific survival curves 
are not provided in that package. 

Table [3] shows several information criteria (AIC, BIG, and DIG where appropriate) for all the models considered 
(with the exception of the DDP survival model, as subject-specific hazard rates and survival curves are not provided 
in that package). Among the MRH models, the PHMRH model has the highest DIG value, which is about 5000 
points greater than any of the NPMRH models. It also has the highest negative log-likelihood, BIG and AIC values, 
despite the smaller number of parameters when compared to the NPMRH models. This is consistent with our earlier 
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Figure 8: LEFT: Goodness-of-fit values across time for both adjusted PE models, the NPH Weibull model, the extended Cox model, 
and the AFT model. CENTER: Goodness-of-fit values across time for the four MRH models (PHMRH, NPMRH-0, NPMRH-3, and 
NPMRH-6). RIGHT: Goodness-of-fit values across time for the NPMRH-3 model with different fixed k values equal to 0.2, 0.5, 1.0 and 
10. The placed by A: = 0.5 denotes that this is the original NPMRH-3 model reported above. There are few differences among most 
models, with almost indistinguishable differences between the MRH model with all levels pruned and the MRH model with only 3 levels 
pruned or the MRH models with different values of k. (Details on the different values of k in the NPMRH models are discussed in section 
14.2.21 ) However, the extended Cox model and both PE models have higher GOF values for the first seven years when compared to the 
others. Note that the GOF statistic was not calculated for the Cox PH model as no estimated hazard rate was available, and was also 
not calculated for the DDP survival model, as we did not have access to the subject-specific survival curves. 


observation that the PH model does not seem to provide a good description of the data. 

When comparing the NPMRH models with different levels of pruning (NPMRH-0, NPMRH-3, NPMRH-6), 
the NPMRH-3 model has the lowest negative log-likelihood value, followed closely by the NPMRH-6 model. The 
NPMRH-6 model has the lowest DIG, BIG, and AIG values as it has the smallest number of estimated parameters 
of all MRH models considered. However, all three NPMRH models have very similar information criteria values, 
with the exception of BIG for NPMRH-0 whose penalty for its large number of parameters sets it apart from the 
rest of the models. It is also notable that among the NPMRH-3 models, the lowest negative log-likelihood, DIG, 
BIG, and AIG values are for the model with k = 0.2, which may be a good choice for examining the hazard rate of 
biochemical failure for this particular data set, as it captures the most details in the failure pattern. The negative 
log-likelihood values (and hence BIG and AIG calculations) of the adjusted PE models are slightly smaller than 
those of the NPMRH models, although the values are comparable. When compared to the NPH Weibull models, 
the NPMRH models all have lower negative log-likelihood values. However, BIG and AIG values are higher in the 
NPMRH models due to the higher number of estimated parameters. The AFT model has a higher negative log- 
likelihood value when compared to the other models (with the exception of the PHMRH model), and the extended 
Gox model has a slightly higher negative log-likelihood value when compared to the MRH models, but the values 
are similar. Regardless of model choice however, all evidence points to the treatment effects not being proportional: 
the effects of an additional 24 months of AD therapy change over the entire length of the study. 

5 Discussion 

This paper illustrates how different modeling and smoothing assumptions effect the estimate of the time-varying 
treatment effect. We present results from a suite of models ranging from parametric to non-parametric, and demon¬ 
strate that different assumptions can lead to very smooth, flat log-hazard ratio estimates (such as those in the NPH 
Weibull model) to estimates which vary more over time (such as those in the MRH, PE, and the DDP survival 
model). Additionally, the different models exhibited a high degree of variability in the goodness-of-fit measure and 
the penalized goodness of fit criteria. We have also shown how choosing different values of k gives the MRH model 
the flexibility to perform similarly to other models, ranging from the piece-wise exponential to the parametric Weibull 
model. The NPMRH model allows for multiple changes in the treatment effects over time, with multiple increases 
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Model 

-2Hog(L) 

Effective JN umber 
of Parameters 

DIG 

BIC 

AIC 


PHMRH 

12628.0 

32 

9651.1 

12860.3 

10555.8 


NPMRH-U 


4703.1 

139 

4751.5 

5712.1 

4981.1 


NPMRH-3 


4669.8 

43 

4665.0 

4981.9 

4755.8 

MRH 

NPMRH-6 


4679.0 

38 

4602.1 

4954.8 

4755.0 


NPMRH-3 (fc = 0.2) 

4667.9 

43 

3582.9 

4980.1 

4753.9 


NPMRH-3 (k = 0.5*) 

4669.8 

43 

4665.0 

4981.9 

4755.8 


NPMRH-3 

'k = 1.0) 

4700.7 

43 

4298.7 

5012.9 

4786.7 


NPMRH-3 

^k = 10) 

4792.8 

43 

4578.2 

5105.0 

4878.8 


Equal bins 


4611.1 

42 

- 

4916.0 

4695.1 


Quantile bins 

4596.7 

56 

- 

5003.2 

4708.7 

NPH V 

Wibull 

4759.9 

11 

- 

4839.7 

4781.9 


5277.9 

- 

- 

- 

- 

Extended Gox 

4747.0 

- 

- 

- 

- 


Table 3: Information criteria (DIC-when applicable, BIC, and AIC) for the 4 MRH models (PHMRH, NPMRH-0, NPMRH-3, and 
NPMRH-6), the NPHMRH-3 model with different fixed values of k (in the Rm,p prior), the non-proportional hazards (NPH) Weibull 
model, the adjusted piece-wise exponential models (adjusted by allowing bins to be merged), the AFT model, and the smoothed extended 
Cox model. In addition, the values of twice the negative log-likelihood {—2*log{L)) and the effective number of parameters (when known) 
are shown. Lower DIG, BIC, and AIC values represent models better supported by the data. Details on the different values for k in the 
NPMRH models will be discussed in section [4. 2.2 1 The DDP survival model is not included as the log-likelihood and information criteria 
values are not available from the fitted model. 


and decreases over the length of a study period. 

Other patient an d disease characteristic covariate effects were similar to those previously seen in this trial 
( Horwitz et al. . 20081) and expected based on the effects of these factors in other studies. Men with higher Gleason 
scores had greater hazard of biochemical failure, although this difference was statistically signihcant only for those 
with Gleason scores of 8 or more. In addition, those with more advanced tumor stage (T-stage 3 or 4) or with 
higher PSA level at diagnosis also had a higher hazard rate of biochemical failure. Men who were older at diagnosis 
were found to have a lower hazard rate of biochemical failure, although this may be still partly confounded with the 
censoring patterns in older patients and warrants further exploration. 

Additionally, the presented analysis has allowed insight into the effects of the duration of AD therapy on biochem¬ 
ical failure, and in particular into how the effects of AD therapy changed throughout the course of the study. While 
it was already apparent that 24 months of additional AD therapy is beneficial (relative to the 0 ad ditional months 
of AD therapy) in that it prolongs the time until biochemical failure and other failure endpoints ( Horwitz et al.l . 
20081 ) . our investigation has revealed additional insights. During and immediately after active therapy, the peak in 


the hazard rate around two years is much flatter for the -|-24m treatment group. In addition, the -1-24 month group 
continued to have a lower hazard rate throughout most of the observation period (over 10 years), although smaller 
due to the non-proportionality of the treatment effect. Thus, it does appear that the benefits of the additional months 
of AD therapy, while diminishing over time, are persistent, which suggests that failure in the longer AD duration 
group are not simply deferred but possibly avoided. On the other hand, for those patients who received short AD 
therapy and did not fail early or during the peak period of failures, their late term prognosis is nearly as favorable 
as those who underwent long duration AD. Thus, until such patients can be prospectively identified, the long AD 
approach would seem to be preferred for all patients. To this end, we also illustrate how the Bayesian approach can 
allow the use of posterior predictive failure probabilities, such as in Figure [3l as aids in clinical contexts. 


Appendix A: Details on the MRH prior and Pruning Method 

MRH prior 

The foundation of the MRH method is a tree-like wavelet-based multi-resolution prior on the hazard function, chosen 
conveniently to allow scalability and consistency across different time scales (i.e minutes, weeks, years, etc). It uses 
a piece-wise constant approximation of the hazard function over J time intervals, parametrized by a set of hazard 
increments dj,j = I,..., J. Here, each dj represents the aggregated hazard rate over the time interval, ranging 
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from In the standard survival analysis notation, dj = ^ h{s)ds = H{tj) — where h{t) is the 

hazard rate at time t. 

To facilitate the recursive diadic partition of the multiresolution tree, we assume that J = 2^. Here, M is an 
integer, set large enough to achieve the desired time resolution for t he hazard rate. M can also be chosen using model 
selection criteria or clinical input, as in Bouman et al. ( 2005 ). and Dignam et al. ( 2009ll . Note that the cumulative 
hazard, H, is equal to the sum of all 2^ hazard increments dj,j = 1,... 2^. The model then recursively splits H 
at different branches via the “split parameters” Rm,p = Hm, 2 p/ to = 1 , 2 ,..., M — 1, p = 0,..., 2”^“^ — 1 . 
Here, H^ q is recursively defined as Hm,q = Hm+i, 2 q + Hm+i, 2 q+i (with iJo,o = and g = 0,..., 2™ — 1). The 


R 


m.p 


split parameters, each between 0 and 1 , guide the shape of the a priori hazard rate over time (Figure [HI)- 


The complete hazard rate prior specification is obtained via priors placed on all tree parameters: a Gamma(a, A) 
prior is placed on the cumulative hazard H, and Beta prior on each split parameter Rm,p, Re(2jm^pk^a, 2(1 — 
7 m,p)fc’"a)- For example, the priors for H and Rm,p in 3-level MRH model (M = 3, J = 8 ) would be: 


H 

Rip 

R2,p 

R3,p 


Ga{a, A), 

Be{2'ji^oka, 2(1 - 'yi^o)ka), 
Be{2j2,pk‘^a, 2(1 - 72,p)fc2a), p = 0,1 
He(273,p/c^a, 2(1 - 73 ,p)fc3a), p = 0,1, 2, 3. 


( 2 ) 


Under this parametrization, the prior distribution of each hazard increment is governed by these Beta and Gamma 
distributions. In particular, their prior expectations depend on the hyperparameters of the Beta and Gamma 
priors - for example, in the above 3-level model E{di) = E{H)E{Ri^Q)E{R 2 p)E{R^fi). Similarly, these MRH 
hyperparameters control the correlation b etween the haza r d incr emen ts dj , and thus dire ctly relate to the smoothness 
of the multiresolution prior, as shown in Bouman et al.l ( 2005ll and iGhen et al.l (l2014ll. This parametrizatio n also 
insures the self-consistency of the MRH prior at multiple resolutions ( Bouman et al.l . r2005HGhen et al.l . 12014 ). 


Pruning the MRH tree 

The MRH prior resolution is chosen as a compromise between the desire for detail in the hazard rate, and the amount 
of data. As the resolution increases (and the number of time intervals increases), observed failure counts within each 
bin will decrease. While useful for revealing detailed patterns, a large number of intervals (and consequently, a 
large number of mod el parameters) will generally require longer computing t i mes a nd result in estimators with lower 
statistical efficiency ( Ghen et al. . 20141) . “Pruning”, as used in Ghen et ahl ( 20141 ). is a data-driven pre-processing 
technique, which combines consecutive Hm,pS that are statistically similar (and happens frequently with periods of 
low failure counts). The technique increases the computational efficiency by decreasing the parameter dimension 
a priori^ which can greatly speed up analyses of non-proportional hazards. The pruning method thus changes the 
overall time resolution of the MRH prior, keeping the higher resolution during the periods of high event counts, and 
lower resolution during periods of low event counts. 

The MRH pruning technique has been extensively studied in iGhen et al.l ( 20141 ). Briefly, pruning starts with the 
full MRH tree prior, and merges adjacent bins that are constructed via the same split parameter, Rm,p, when the 
hazard increments in these two bins (i?m+i, 2 p and Hm+i. 2 p+i) are statistically similar. This is inferred by testing 
the hypothesis Hq : Rm,p = 0.5 against the alternative Ha : Rm,p ^ 0.5, with a pre-set type I error a, for each split 
parameter Rm,p {p = 0,...,2’"“^ — 1). If the null hypothesis is not rejected, that split Rm,p is set to 0.5 and the 
adjacent hazard increments are considered equal and the time bins declared “fused”. The hypothesis testing can be 
applied to all M levels of the tree or just a higher resolution subset of the tree. While the pruning is expected to 
reduce the amount hazard rate detail discovered by the MRH method, the posterior hazard rate estimator is shown 
to have lower risk compared to its equivalent from the non-pruned model (|Chen et al.L 120141) . 


Appendix B: Estimation and Pruning Steps 

The estimation algorithm is performed two steps: the pruning step and the Gibbs sampler routine. The details are 
listed below. 
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Figure 9: Example of an MRH prior mean (black), centered at a desired parametric hazard rate (yellow), at various resolution 
levels m. The first figure (upper left) shows the mean MRH rate at the first level (m = 1), with E{Hi^o) = E{H) * E[Ri^q) and 
E{Hi^i) = E{H) *E{1 — Ri,q). Ri,q prior mean of 0.76 reflects the higher hazard during the first time interval. The second figure (upper 
right) shows the mean MRH rate at the second level (m = 2), with E{H 2 ,o) = E{H) * E{Ri^q) * E{R 2 ,q), and E{H 2 ,i), E{H 2 , 2 ) ^-nd 
-£'(^ 2 , 3 ) derived analogously. The third figure (lower left), shows the mean MRH rate at the third level (m = 3). The last figure (lower 
right), shows the mean MRH rate for m = 6, which closely matches the true hazard rate. In all the flgures, the mean MRH rates for the 
previous resolutions are shown in gre y, E(Rm.v) sh own in green, and ^(1 — Rm,p) in red. The advantage of the MRH model is its 
“self-consistency” under aggregation dChen et al.L [20l3) . which means that the prior speciflcation at any level m is independent of the 
ultimate level M. This property allows the hazard rate to be examined at multiple resolutions in a consistent manner. 
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Pruning step 

The pruning step is run only once for each of the L hazard rates at the beginning of the algorithm as a pre-processing 
step in order to finalize the MRH tree priors. The Rm,p-/ parameters for which the null hypothesis is not rejected 
are set to 0.5 with probability 1, while the rest are estimated in the Markov chain Monte Carlo (MCMC) routine. 

Gibbs sampler steps 

After the pruning step, the Gibbs sampler algorithm is performed to obtain the approximate posterior distribution 
of Of, Af, kn, 7 £, and the Rm,p-,i that have not been set to 0.5 for each stratum (£ = !,...,£) as well as /3. 

The algorithm is as follows, with steps repeated until convergence: 

1. For each of the C treatment hazard rates (£ = !,...,£): 

(a) Sample from the posterior for which is a gamma density with the shape parameter Of+ 

and rate parameter A^^ -I- exp Fi{Ti^i), where Ft^T^) = 

(b) Sample a^, A^ from their respective posterior distributions (see below). 

(c) Sample each Rm.p-,i for which the null hypothesis was rejected from the full conditional: 

exp (-exp (x'^) }. 

(d) Sample ke-,"fm,p;e from their respective posterior distributions (see below). 

2. With a A/'(0, ) prior (with a known variance) on each covariate effect modeled under the proportional hazards 

assumption, /Sg (s = 1,..., z), each has the following full conditional distribution: 

tt{I3s\Pj) oc (n^^illigtx^ [exp{Xy7/34]^’''exp|-exp i?£F>(Ti 7 )|^ exp 

Note that this posterior distribution includes the full set of observations and covariates, from all strata jointly. 



Full conditionals for the hyperparameters a, A, k, and 7m,p 


The parameters in the prior distributions of and all Rm.pS for each covariate stratum (£ = 1,..., £), ai, Xi, ki, 
and can either be fixed at desired values, or treated as random variables with their own set of hyperpriors. 

In the case of the latter, they would be sampled within the Gibbs sampler separately for each stratum, according to 
their own full conditional distributions. Below are the forms of these full conditional distributions for a specific set 
of hyperpriors we chose. 

For notational simplicity, the stratum-specific index is suppressed below. The notation rj~ will be used to denote 
the set of all data and all parameters except for the parameter rj itself. The full conditionals are as follows: 


• If a is given a zero-truncated Poisson prior, 
conditional distribution for a is: 


e 

a! (I — e“'^“) 


(chosen for computational convenience), the full 


7r(a I a ) oc 


Tja ,,a 
R k-a 

A“(a — I)!a! 


n 


M 

m —1 



B(27m_pfc™a, 2(1 - ^m,p)k'^a) J 


• If the scale parameter A in the gamma prior for the cumulative hazard function F[ is given an exponential prior 
with mean /i^,, the resulting full conditional is: 

MA|A-)«^exp{-(| + A)} 
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• If /c is given an exponential prior distribution with mean the full conditional distribution for k is as follows: 


7r(fc I fc-) oc 


"-1 


B{2jjn,pk"^a, 2(1 - ^m,p)k^a) J ^ 


• If a Beta(M, w) prior is placed on each 7m,p, the full conditional distribution for each 7m,p is proportional to: 


D27m,pfc”*a/, „ \2(l-7m,p)fe™. 

-^m.p V-*- ^m,p) 

B(27m,pfc'"a, 2(1 - 'ym,p)k^a) 


7m,p (1 '7m,,p) 


—1 
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